Linear Regression Analysis

Dr. Thiyanga S. Talagala
Department of Statistics, Faculty of Applied Sciences
University of Sri Jayewardenepura, Sri Lanka

What is Regression Analysis?

  • Statistical technique for investigating and modelling the relationship between variables.

Statistical Modelling

  • a simplified, mathematically-formalized way to approximate reality (i.e. what generates your data) and optionally to make predictions from this approximation.

  • Regression Analysis involves curve fitting.

  • Curve fitting: The process of finding a relation or equation of best fit.

Model

\[Y = f(x_1, x_2, x_3) + \epsilon\]

Goal: Estimate \(f\) ?

How do we estimate \(f\)?

Non-parametric methods:

estimate \(f\) using observed data without making explicit assumptions about the functional form of \(f\).

Parametric methods

estimate \(f\) using observed data by making assumptions about the functional form of \(f\).

Ex: \(Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \epsilon\)

background-image: url(‘regressionpaper2.png’) background-position: center background-size: contain

Do not under-estimate the power of simple models.

Create something new which is more efficient than the existing method.

Pearson’s Correlation Coefficient

  • Measures the strength of the linear relationship between two quantitative variables.

  • returns a value of between -1 and +1. A -1 means there is a strong negative correlation and +1 means that there is a strong positive correlation.

  • Does not completely characterize their relationship.

  • is very sensitive to outliers.

Variance and Standard Deviations

\[ \sigma^2 = \frac{\sum_{i=1}^N (x_i-\mu_x)^2}{N} \]

\[ \sigma = \sqrt\frac{\sum_{i=1}^N (x_i-\mu_x)^2}{N} \]

Covariance

\[ cov(x,y) = \frac{\sum_{i=1}^N (x_i-\mu_x)(y_i-\mu_y)}{N} \]

Your turn: Create a geometrical demonstration

Terminologies

  • Response variable: dependent variable

  • Explanatory variables: independent variables, predictors, regressor variables, features (in Machine Learning)

Simple Linear Regression

Simple - single regressor

Linear - parameters enter in a linear fashion.

What about this?

\[Y = \beta_0 + \beta_1x_1 + \beta_{2}x_2 + \epsilon\]

Linear or nonlinear?

\[Y = \beta_0 + \beta_1x + \beta_{2}x^2 + \epsilon\]

Linear or nonlinear?

\[Y = \beta_0e^{\beta_1x} + \epsilon\]

What about this?

\[Y = \alpha X_1^\beta X_2^\gamma X_3^\delta e^\epsilon\]

True relationship between X and Y in the population

\[Y = f(X) + \epsilon\]

If \(f\) is approximated by a linear function

\[Y = \beta_0 + \beta_1X + \epsilon\]

The error terms are normally distributed with mean \(0\) and variance \(\sigma^2\). Then the mean response, \(Y\), at any value of the \(X\) is

\[E(Y|X=x_i) = E(\beta_0 + \beta_1x_i + \epsilon)=\beta_0+\beta_1x_i\]

For a single unit \((y_i, x_i)\)

\[y_i = \beta_0 + \beta_1x_i+\epsilon_i \text{ where } \epsilon_i \sim N(0, \sigma^2)\]

We use sample values \((y_i, x_i)\) where \(i=1, 2, ...n\) to estimate \(\beta_0\) and \(\beta_1\).

The fitted regression model is

\[\hat{Y_i} = \hat{\beta}_0 + \hat{\beta}_1x_i\]

Population Regression

\[E(Y|X=x_i) = \beta_0+\beta_1x_i\]

Population mean

Population mean (red) and sample mean (green)

Dashboard: https://statisticsmart.shinyapps.io/SimpleLinearRegression/

Different types of regression models

  1. Linear Regression

  2. Quantile Regression

  3. Piece-wise (Segmented) Regression

  4. LOESS (Locally Estimated Scatterplot Smoothing)

  5. Hodrick-Prescott (HP) Filter

  6. Multivariate Regression